In this project, I conducted sentiment analysis using Python on two volumes from the HathiTrust Digital Library to examine how the emotional valence changes across each text. The book titles I analyzed were one fiction novel The Count of Monte Cristo by Alexandre Dumas and one nonfiction text The Origin of Species by Charles Darwin. My goal was to generate visualizations of the change in emotional valence over the span of these books.
I utilized two forms of textual data for my analysis: full text (TXT) files downloaded directly from HathiTrust Digital Library, as well as Extracted Features (EF) obtained from HathiTrust Research Center (HTRC) Analytics. HTRC Analytics enables non-profit research and educational uses of materials in the HathiTrust collection, including those still under copyright. Specifically, the Extracted Features contain metadata about volumes and pages alongside part-of-speech-tagged tokens and token counts extracted from full texts.
With this textual data, I performed sentiment analysis using three tools: VADER, TextBlob, and AFINN. Each tool assigns sentiment scores to input texts which can be aggregated and visualized to show how emotional valence shifts across the span of pages in each book.
To download the full-text files for entire volumes from HathiTrust, one needs to be affiliated with a HathiTrust member institution and logged into the HathiTrust website using institutional credentials.
Full text files can be directly downloaded from each item's page. Extracted features must be accessed through HTRC Analytics, following this EF download tutorial.
For this project, I downloaded the following data files:
The Count of Monte Cristo: mdp-39015062136661-1693964099.txt (full text) and mdp.39015062136661.json.bz2 (Extracted Features)
On the Origin of Species: hvd-hw39sc-1696432701.txt (full text) and hvd.hw39sc.json.bz2 (Extracted Features)
In this project, I have selected three widely-used sentiment analysis tools:
These tools can be installed in a Python environment using pip commands:
pip install nltk
pip install textblob
pip install afinn
To analyze Extracted Features, htrc-feature-reader is needed. It is a tool designed specifically to work with Extracted Features from HTRC.
pip install htrc-feature-reader
Other modules needed for the project include pandas and plotly. The first tool is used for data analysis, and the second is for plotting interactive graphs.
pip install pandas
pip install plotly==5.18.0
# Import libraries for data analysis and visualization
import re # for regular expression operations
import pandas as pd
import plotly.graph_objects as go
# Import the sentiment analysis tools
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from textblob import TextBlob
from afinn import Afinn
The full-text file contained the entire book of The Count of Monte Cristo without any page breaks. To analyze the text, I needed to structure it into a more manageable format. I aimed to turn the text into a DataFrame with two columns: one for page numbers and the other for the content on each page.
While looking at the TXT file, I noticed markers that indicate page breaks, formatted as ## p. (#1) #################################################. I decided to use regular expressions (RegEx) to identify these markers and parse the text accordingly.
# Open the file and split the text into lines
with open('mdp-39015062136661-1693964099.txt', 'r', encoding='utf-8') as file:
text = file.read()
lines = text.split('\n')
# Initialize lists for page numbers and content
page_numbers = []
page_content = []
current_page_number = None
current_page_content = []
# Parse the text with RegEx
for line in lines:
if line.startswith("## p. "):
if current_page_number is not None:
page_numbers.append(current_page_number)
page_content.append(" ".join(current_page_content))
page_pattern = r'## p\. (\d+)'
match = re.match(page_pattern, line)
if match:
current_page_number = match.group(1)
current_page_content = []
else:
current_page_content.append(line)
Next, I created a DataFrame named dumas_full_text to organize page_numbers and page_content.
dumas_full_text = pd.DataFrame({
'page_number': [int(x) for x in page_numbers],
'page_content': page_content,
})
Here is a preview of the DataFrame for pages 11 to 15:
dumas_full_text[10:15]
| page_number | page_content | |
|---|---|---|
| 10 | 11 | THE COUNT OF MONTE CRISTO 11 navy—a costume s... |
| 11 | 12 | 12 THE COUNT OF MONTE CRISTO 1 piness is like... |
| 12 | 13 | THE COUNT OF MONTE CRISTO 13 elder Dantès, wh... |
| 13 | 14 | 14 THE COUNT OF MONTE CRISTO 1 1 ought to do,... |
| 14 | 15 | THE COUNT OF MONTE CRISTO 15 “You understand ... |
With the DataFrame ready, I proceeded with sentiment analysis using the VADER tool:
analyzer = SentimentIntensityAnalyzer()
dumas_full_text['vader_sentiment_score'] = 0.0
dumas_full_text['vader_sentiment'] = ""
for tuple in dumas_full_text.itertuples():
sentence = tuple.page_content
sentiment_dictionary = analyzer.polarity_scores(sentence)
compound = sentiment_dictionary['compound']
dumas_full_text.at[tuple.Index, 'vader_sentiment_score'] = compound
if compound >= 0.33:
vader_sentiment = "Positive"
elif compound <= -0.33:
vader_sentiment = "Negative"
else:
vader_sentiment = "Neutral"
dumas_full_text.at[tuple.Index, 'vader_sentiment'] = vader_sentiment
dumas_full_text[10:15]
| page_number | page_content | vader_sentiment_score | vader_sentiment | |
|---|---|---|---|---|
| 10 | 11 | THE COUNT OF MONTE CRISTO 11 navy—a costume s... | 0.9972 | Positive |
| 11 | 12 | 12 THE COUNT OF MONTE CRISTO 1 piness is like... | 0.8093 | Positive |
| 12 | 13 | THE COUNT OF MONTE CRISTO 13 elder Dantès, wh... | 0.8625 | Positive |
| 13 | 14 | 14 THE COUNT OF MONTE CRISTO 1 1 ought to do,... | -0.5018 | Negative |
| 14 | 15 | THE COUNT OF MONTE CRISTO 15 “You understand ... | 0.9674 | Positive |
The process for sentiment analysis with TextBlob is similar to that with VADER:
dumas_full_text['textblob_sentiment_score'] = 0.0
dumas_full_text['textblob_sentiment'] = ""
for tuple in dumas_full_text.itertuples():
sentence = tuple.page_content
classifier = TextBlob(sentence)
polarity = classifier.sentiment.polarity
dumas_full_text.at[tuple.Index, 'textblob_sentiment_score'] = polarity
if polarity >= 0.1:
textblob_sentiment = "Positive"
elif polarity <= -0.1:
textblob_sentiment = "Negative"
else:
textblob_sentiment = "Neutral"
dumas_full_text.at[tuple.Index, 'textblob_sentiment'] = textblob_sentiment
dumas_full_text[10:15]
| page_number | page_content | vader_sentiment_score | vader_sentiment | textblob_sentiment_score | textblob_sentiment | |
|---|---|---|---|---|---|---|
| 10 | 11 | THE COUNT OF MONTE CRISTO 11 navy—a costume s... | 0.9972 | Positive | 0.239049 | Positive |
| 11 | 12 | 12 THE COUNT OF MONTE CRISTO 1 piness is like... | 0.8093 | Positive | 0.108443 | Positive |
| 12 | 13 | THE COUNT OF MONTE CRISTO 13 elder Dantès, wh... | 0.8625 | Positive | 0.090585 | Neutral |
| 13 | 14 | 14 THE COUNT OF MONTE CRISTO 1 1 ought to do,... | -0.5018 | Negative | 0.090830 | Neutral |
| 14 | 15 | THE COUNT OF MONTE CRISTO 15 “You understand ... | 0.9674 | Positive | 0.052443 | Neutral |
Lastly, I used AFINN to analyze the sentiment across the full text of The Count of Monte Cristo.
afinn = Afinn(language='en')
dumas_full_text['afinn_sentiment_score'] = 0.0
for tuple in dumas_full_text.itertuples():
sentence = tuple.page_content
score = afinn.score(sentence)
dumas_full_text.at[tuple.Index, 'afinn_sentiment_score'] = score
dumas_full_text[10:15]
| page_number | page_content | vader_sentiment_score | vader_sentiment | textblob_sentiment_score | textblob_sentiment | afinn_sentiment_score | |
|---|---|---|---|---|---|---|---|
| 10 | 11 | THE COUNT OF MONTE CRISTO 11 navy—a costume s... | 0.9972 | Positive | 0.239049 | Positive | 38.0 |
| 11 | 12 | 12 THE COUNT OF MONTE CRISTO 1 piness is like... | 0.8093 | Positive | 0.108443 | Positive | 12.0 |
| 12 | 13 | THE COUNT OF MONTE CRISTO 13 elder Dantès, wh... | 0.8625 | Positive | 0.090585 | Neutral | 0.0 |
| 13 | 14 | 14 THE COUNT OF MONTE CRISTO 1 1 ought to do,... | -0.5018 | Negative | 0.090830 | Neutral | -16.0 |
| 14 | 15 | THE COUNT OF MONTE CRISTO 15 “You understand ... | 0.9674 | Positive | 0.052443 | Neutral | 0.0 |
One thing to note is that the scale of the afinn_sentiment_score is different from the scales of vader_sentiment_score and textblob_sentiment_score. While VADER and TextBlob's range is between -1 and 1, AFINN's scores are sums of sentiment values of individual words, which resulted in much larger absolute values. Therefore, I needed to normalize AFINN to the range of -1 to 1.
# Normalize the AFINN sentiment scores
min_value = min(dumas_full_text['afinn_sentiment_score'])
max_value = max(dumas_full_text['afinn_sentiment_score'])
normalized_numbers = [(x - min_value) / (max_value - min_value) for x in dumas_full_text['afinn_sentiment_score']]
# Adjust the normalized numbers to the -1 to 1 range
afinn_normalized = [2 * x - 1 for x in normalized_numbers]
dumas_full_text['afinn_normalized'] = afinn_normalized
dumas_full_text[10:15]
| page_number | page_content | vader_sentiment_score | vader_sentiment | textblob_sentiment_score | textblob_sentiment | afinn_sentiment_score | afinn_normalized | |
|---|---|---|---|---|---|---|---|---|
| 10 | 11 | THE COUNT OF MONTE CRISTO 11 navy—a costume s... | 0.9972 | Positive | 0.239049 | Positive | 38.0 | 0.525926 |
| 11 | 12 | 12 THE COUNT OF MONTE CRISTO 1 piness is like... | 0.8093 | Positive | 0.108443 | Positive | 12.0 | 0.140741 |
| 12 | 13 | THE COUNT OF MONTE CRISTO 13 elder Dantès, wh... | 0.8625 | Positive | 0.090585 | Neutral | 0.0 | -0.037037 |
| 13 | 14 | 14 THE COUNT OF MONTE CRISTO 1 1 ought to do,... | -0.5018 | Negative | 0.090830 | Neutral | -16.0 | -0.274074 |
| 14 | 15 | THE COUNT OF MONTE CRISTO 15 “You understand ... | 0.9674 | Positive | 0.052443 | Neutral | 0.0 | -0.037037 |
Moving on to analyzing the Extracted Features, I first imported FeatureReader from the htrc_features library.
from htrc_features import FeatureReader
import warnings
# The warnings are suppressed to avoid clutter in the output and do not affect the program's functionality
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=FutureWarning)
paths = ['mdp.39015062136661.json.bz2']
fr = FeatureReader(paths)
vol = next(fr.volumes())
Then, I created a DataFrame for Extracted Features named dumas_ef. I also grouped the tokens by page number, so that I could analyze the content on each page in a way similar to full text.
dumas_ef = vol.tokenlist(pos=False, case=False)\
.reset_index().drop(['section'], axis=1)
dumas_ef.columns = ['Page Number', 'token', 'count']
# Group tokens by page number
grouped_tokens = dumas_ef.groupby('Page Number')
The next section of the code combines the sentiment analysis of the EF with the three tools: VADER, TextBlob and AFINN.
# Initialize lists to store sentiment analysis results
ef_vader_sentiment_score = []
ef_textblob_sentiment_score = []
ef_afinn_sentiment_score = []
# Perform sentiment analysis for each page
for name, group in grouped_tokens:
page_text = " ".join([word * count for word, count in zip(group['token'], group['count'])])
# VADER Analysis
sentiment_scores = analyzer.polarity_scores(page_text)
ef_vader_sentiment_score.append(sentiment_scores['compound'])
# TextBlob Analysis
sentiment = TextBlob(page_text).sentiment
ef_textblob_sentiment_score.append(sentiment.polarity)
# AFINN Analysis
sentiment_score = afinn.score(page_text)
ef_afinn_sentiment_score.append(sentiment_score)
# Create a DataFrame with all sentiment analysis results
dumas_ef = pd.DataFrame({
'page_number': [int(x) for x in grouped_tokens.groups.keys()],
'ef_vader_sentiment_score': ef_vader_sentiment_score,
'ef_textblob_sentiment_score': ef_textblob_sentiment_score,
'ef_afinn_sentiment_score': ef_afinn_sentiment_score
})
dumas_ef[10:15]
| page_number | ef_vader_sentiment_score | ef_textblob_sentiment_score | ef_afinn_sentiment_score | |
|---|---|---|---|---|
| 10 | 18 | 0.9524 | 0.155742 | 0.0 |
| 11 | 19 | 0.8484 | 0.156593 | 2.0 |
| 12 | 20 | 0.6795 | 0.133160 | 4.0 |
| 13 | 21 | 0.8987 | 0.108465 | 10.0 |
| 14 | 22 | 0.9511 | 0.008393 | 12.0 |
Again, I needed to normalize the AFINN scores.
min_value = min(dumas_ef['ef_afinn_sentiment_score'])
max_value = max(dumas_ef['ef_afinn_sentiment_score'])
normalized_numbers = [(x - min_value) / (max_value - min_value) for x in dumas_ef['ef_afinn_sentiment_score']]
ef_afinn_normalized = [2 * x - 1 for x in normalized_numbers]
dumas_ef['ef_afinn_normalized'] = ef_afinn_normalized
dumas_ef[['ef_afinn_sentiment_score', 'ef_afinn_normalized']][10:15]
| ef_afinn_sentiment_score | ef_afinn_normalized | |
|---|---|---|
| 10 | 0.0 | -0.075269 |
| 11 | 2.0 | -0.032258 |
| 12 | 4.0 | 0.010753 |
| 13 | 10.0 | 0.139785 |
| 14 | 12.0 | 0.182796 |
Another thing that needed to be adjusted is the page numbers in the dumas_ef DataFrame. The original page numbers in dumas_ef were higher because they included pages from the book's front matter, such as the cover and copyright pages. By subtracting 16, the number of front matter of this book, I ensured that the page numbers in dumas_ef match those in dumas_full_text for meaningful comparison.
FRONT_MATTER_PAGES = 16
dumas_ef['page_number'] = dumas_ef['page_number'].astype(int) - FRONT_MATTER_PAGES
dumas_ef[10:15]
| page_number | ef_vader_sentiment_score | ef_textblob_sentiment_score | ef_afinn_sentiment_score | ef_afinn_normalized | |
|---|---|---|---|---|---|
| 10 | 2 | 0.9524 | 0.155742 | 0.0 | -0.075269 |
| 11 | 3 | 0.8484 | 0.156593 | 2.0 | -0.032258 |
| 12 | 4 | 0.6795 | 0.133160 | 4.0 | 0.010753 |
| 13 | 5 | 0.8987 | 0.108465 | 10.0 | 0.139785 |
| 14 | 6 | 0.9511 | 0.008393 | 12.0 | 0.182796 |
My goal is to visualize the overall emotional trend throughout the book. For this, a detailed granularity is not necessary. Therefore, I used a rolling mean with a window size of 20 pages to smooth out the data.
WINDOW_SIZE = 20
# Columns in dumas_full_text for rolling window operation
columns_full_text = ['vader_sentiment_score', 'textblob_sentiment_score', 'afinn_normalized']
for col in columns_full_text:
dumas_full_text[col] = dumas_full_text[col].rolling(window=WINDOW_SIZE, min_periods=1).mean()
# Columns in dumas_ef for rolling window operation
columns_ef = ['ef_vader_sentiment_score', 'ef_textblob_sentiment_score', 'ef_afinn_normalized']
for col in columns_ef:
dumas_ef[col] = dumas_ef[col].rolling(window=WINDOW_SIZE, min_periods=1).mean()
Finally, I used plotly to create an interactive graph with all the processed data. The interactive graph enables the viewer to toggle which graph(s) they would like to see and allows for a more straightforward comparison.
fig = go.Figure()
# Plotting for dumas_full_text DataFrame
sentiment_scores_full_text = {
'vader_sentiment_score': 'Full Text VADER',
'textblob_sentiment_score': 'Full Text TextBlob',
'afinn_normalized': 'Full Text AFINN Normalized'
}
for column, label in sentiment_scores_full_text.items():
fig.add_trace(go.Scatter(x=dumas_full_text['page_number'], y=dumas_full_text[column], mode='lines', name=label))
# Plotting for dumas_ef DataFrame
sentiment_scores_ef = {
'ef_vader_sentiment_score': 'EF VADER',
'ef_textblob_sentiment_score': 'EF TextBlob',
'ef_afinn_normalized': 'EF AFINN Normalized'
}
for column, label in sentiment_scores_ef.items():
fig.add_trace(go.Scatter(x=dumas_ef['page_number'], y=dumas_ef[column], mode='lines', name=label))
# Additional plot settings
fig.update_layout(
title=f'Emotional Valence throughout "The Count of Monte Cristo"',
xaxis_title='Page Number',
yaxis_title='Sentiment Score',
showlegend=True
)
fig.show()
Moving on to the nonfiction example: The Origin of Species. The process was largely the same, with only a few minor differences.
In this book's full-text file, the page numbering resets after page 400, marking the start of Part 2. To ensure a continuous page count throughout the book, I implemented an offset that allows the numbering to continue past 400 instead of restarting.
with open('hvd-hw39sc-1696432701.txt', 'r', encoding='utf-8') as file:
text = file.read()
lines = text.split('\n')
page_numbers = []
page_content = []
current_page_number = None
current_page_content = []
offset = 0 # Initialize an offset
for line in lines:
if line.startswith("## p. "):
if current_page_number is not None:
page_numbers.append(current_page_number)
page_content.append(" ".join(current_page_content))
page_pattern = r'## p\. (\d+)'
match = re.match(page_pattern, line)
if match:
# Check for page number reset
if int(match.group(1)) == 1 and current_page_number is not None:
offset += current_page_number # Update the offset with the last page number
current_page_number = int(match.group(1)) + offset
current_page_content = []
else:
current_page_content.append(line)
As with the fiction example, I created a DataFrame for the full text named darwin_full_text.
darwin_full_text = pd.DataFrame({
'page_number': [int(x) for x in page_numbers],
'page_content': page_content,
})
darwin_full_text[10:15]
| page_number | page_content | |
|---|---|---|
| 10 | 11 | A HISTORICAL SKETCH OF THE PROGRESS OF OPINIO... |
| 11 | 12 | 12 HISTORICAL SKETCH times has treated it in ... |
| 12 | 13 | HISTORICAL SKETCH 13 he maintains that such f... |
| 13 | 14 | 14 HISTORICAL SKETCH tinctly recognizes the p... |
| 14 | 15 | HISTORICAL SKETCH 15 The Hon. and Rev. W. Her... |
analyzer = SentimentIntensityAnalyzer()
darwin_full_text['vader_sentiment_score'] = 0.0
darwin_full_text['vader_sentiment'] = ""
for tuple in darwin_full_text.itertuples():
sentence = tuple.page_content
sentiment_dictionary = analyzer.polarity_scores(sentence)
compound = sentiment_dictionary['compound']
darwin_full_text.at[tuple.Index, 'vader_sentiment_score'] = compound
if compound >= 0.33:
vader_sentiment = "Positive"
elif compound <= -0.33:
vader_sentiment = "Negative"
else:
vader_sentiment = "Neutral"
darwin_full_text.at[tuple.Index, 'vader_sentiment'] = vader_sentiment
darwin_full_text['textblob_sentiment_score'] = 0.0
darwin_full_text['textblob_sentiment'] = ""
for tuple in darwin_full_text.itertuples():
sentence = tuple.page_content
classifier = TextBlob(sentence)
polarity = classifier.sentiment.polarity
darwin_full_text.at[tuple.Index, 'textblob_sentiment_score'] = polarity
if polarity >= 0.1:
textblob_sentiment = "Positive"
elif polarity <= -0.1:
textblob_sentiment = "Negative"
else:
textblob_sentiment = "Neutral"
darwin_full_text.at[tuple.Index, 'textblob_sentiment'] = textblob_sentiment
afinn = Afinn(language='en')
darwin_full_text['afinn_sentiment_score'] = 0.0
for tuple in darwin_full_text.itertuples():
sentence = tuple.page_content
score = afinn.score(sentence)
darwin_full_text.at[tuple.Index, 'afinn_sentiment_score'] = score
# Normalize AFINN scores
min_value = min(darwin_full_text['afinn_sentiment_score'])
max_value = max(darwin_full_text['afinn_sentiment_score'])
normalized_numbers = [(x - min_value) / (max_value - min_value) for x in darwin_full_text['afinn_sentiment_score']]
afinn_normalized = [2 * x - 1 for x in normalized_numbers]
darwin_full_text['afinn_normalized'] = afinn_normalized
darwin_full_text[10:15]
| page_number | page_content | vader_sentiment_score | vader_sentiment | textblob_sentiment_score | textblob_sentiment | afinn_sentiment_score | afinn_normalized | |
|---|---|---|---|---|---|---|---|---|
| 10 | 11 | A HISTORICAL SKETCH OF THE PROGRESS OF OPINIO... | 0.9172 | Positive | 0.092385 | Neutral | 12.0 | 0.062500 |
| 11 | 12 | 12 HISTORICAL SKETCH times has treated it in ... | 0.9807 | Positive | 0.151984 | Positive | 20.0 | 0.229167 |
| 12 | 13 | HISTORICAL SKETCH 13 he maintains that such f... | 0.5499 | Positive | 0.043086 | Neutral | 1.0 | -0.166667 |
| 13 | 14 | 14 HISTORICAL SKETCH tinctly recognizes the p... | 0.9829 | Positive | 0.157964 | Positive | 7.0 | -0.041667 |
| 14 | 15 | HISTORICAL SKETCH 15 The Hon. and Rev. W. Her... | 0.8620 | Positive | 0.028126 | Neutral | 7.0 | -0.041667 |
paths = ['hvd.hw39sc.json.bz2']
fr = FeatureReader(paths)
vol = next(fr.volumes())
darwin_ef = vol.tokenlist(pos=False, case=False) \
.reset_index().drop(['section'], axis=1)
darwin_ef.columns = ['Page Number', 'token', 'count']
grouped_tokens = darwin_ef.groupby('Page Number')
ef_vader_sentiment_score = []
ef_textblob_sentiment_score = []
ef_afinn_sentiment_score = []
for name, group in grouped_tokens:
page_text = " ".join([word * count for word, count in zip(group['token'], group['count'])])
# VADER Analysis
sentiment_scores = analyzer.polarity_scores(page_text)
ef_vader_sentiment_score.append(sentiment_scores['compound'])
# TextBlob Analysis
sentiment = TextBlob(page_text).sentiment
ef_textblob_sentiment_score.append(sentiment.polarity)
# AFINN Analysis
sentiment_score = afinn.score(page_text)
ef_afinn_sentiment_score.append(sentiment_score)
# Create a DataFrame with all sentiment analysis results
darwin_ef = pd.DataFrame({
'page_number': [int(x) for x in grouped_tokens.groups.keys()],
'ef_vader_sentiment_score': ef_vader_sentiment_score,
'ef_textblob_sentiment_score': ef_textblob_sentiment_score,
'ef_afinn_sentiment_score': ef_afinn_sentiment_score
})
# Normalize AFINN scores for EF
min_value = min(darwin_ef['ef_afinn_sentiment_score'])
max_value = max(darwin_ef['ef_afinn_sentiment_score'])
normalized_numbers = [(x - min_value) / (max_value - min_value) for x in darwin_ef['ef_afinn_sentiment_score']]
ef_afinn_normalized = [2 * x - 1 for x in normalized_numbers]
darwin_ef['ef_afinn_normalized'] = ef_afinn_normalized
darwin_ef[10:15]
| page_number | ef_vader_sentiment_score | ef_textblob_sentiment_score | ef_afinn_sentiment_score | ef_afinn_normalized | |
|---|---|---|---|---|---|
| 10 | 15 | 0.7574 | -0.021173 | 5.0 | -0.106383 |
| 11 | 16 | 0.8779 | 0.082906 | -1.0 | -0.361702 |
| 12 | 17 | 0.8497 | 0.121780 | 4.0 | -0.148936 |
| 13 | 18 | 0.9545 | 0.183201 | 18.0 | 0.446809 |
| 14 | 19 | 0.0516 | 0.127521 | 1.0 | -0.276596 |
# Subtract the number of pages in front matter from EF DataFrame
FRONT_MATTER_PAGES = 16
darwin_ef['page_number'] = darwin_ef['page_number'].astype(int) - FRONT_MATTER_PAGES
# Smoothing the graph with rolling mean
WINDOW_SIZE = 20
columns_full_text = ['vader_sentiment_score', 'textblob_sentiment_score', 'afinn_normalized']
for col in columns_full_text:
darwin_full_text[col] = darwin_full_text[col].rolling(window=WINDOW_SIZE, min_periods=1).mean()
columns_ef = ['ef_vader_sentiment_score', 'ef_textblob_sentiment_score', 'ef_afinn_normalized']
for col in columns_ef:
darwin_ef[col] = darwin_ef[col].rolling(window=WINDOW_SIZE, min_periods=1).mean()
fig = go.Figure()
# Plotting for dumas_full_text DataFrame
sentiment_scores_full_text = {
'vader_sentiment_score': 'Full Text VADER',
'textblob_sentiment_score': 'Full Text TextBlob',
'afinn_normalized': 'Full Text AFINN Normalized'
}
for column, label in sentiment_scores_full_text.items():
fig.add_trace(go.Scatter(x=darwin_full_text['page_number'], y=darwin_full_text[column], mode='lines', name=label))
# Plotting for dumas_ef DataFrame
sentiment_scores_ef = {
'ef_vader_sentiment_score': 'EF VADER',
'ef_textblob_sentiment_score': 'EF TextBlob',
'ef_afinn_normalized': 'EF AFINN Normalized'
}
for column, label in sentiment_scores_ef.items():
fig.add_trace(go.Scatter(x=darwin_ef['page_number'], y=darwin_ef[column], mode='lines', name=label))
# Additional plot settings
fig.update_layout(
title=f'Emotional Valence throughout "The Origin of Species"',
xaxis_title='Page Number',
yaxis_title='Sentiment Score',
showlegend=True
)
fig.show()
During my project, I experimented with sentiment analysis using Large Language Models (LLMs) such as BERTweet and SiEBERT. However, the approach encountered several challenges. For the full text analysis, the length of content on each page often exceeded the token limit of these models. I attempted to segment the page content into individual sentences using spaCy, but occasionally, even these sentences were too lengthy. Truncating sentences to fit within the token limit compromised the accuracy of the analysis.
In the case of Extracted Features, the use of LLMs proved impractical. The token lists comprised isolated words without context, which contradicts the advantage of LLMs: analyzing more extended sentences or texts to understand the overall sentiment. Additionally, the computational demands of running LLMs exceeded the capabilities of my available hardware.
Given these limitations and the challenges, I ultimately decided against including LLMs in the final iteration of my project.
In analyzing the graphs for both the fiction and nonfiction examples, several key observations emerged:
These findings highlight the distinct characteristics and tendencies of each sentiment analysis tool when applied to both fiction and nonfiction works.
In closing, I would like to express my gratitude to Glen Layne-Worthey for his invaluable guidance and encouragement throughout this project. I am also deeply thankful to Ryan Dubnicek for sharing his technical expertise, which greatly aided this work.
Bowers, Katherine and Quinn Dombrowski. “Katia and the Sentiment Snobs”. The Data-Sitters Club. October 25, 2021. https://datasittersclub.github.io/site/dsc11.html.
Organisciak, Peter and Boris Capitanu. "Text Mining in Python through the HTRC Feature Reader." Programming Historian 5 (2016). https://doi.org/10.46430/phen0058.